df <- read_csv("ESS11.csv")
## Rows: 40156 Columns: 640
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): name, proddate, cntry, cntbrthd, lnghom1, lnghom2, fbrncntc, mbr...
## dbl (616): essround, edition, idno, dweight, pspwght, pweight, anweight, nw...
## dttm (15): inwds, ainws, ainwe, binwe, cinwe, dinwe, einwe, finwe, hinwe, i...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df)
First of all, I will weight all the variables and decide which one to keep for further detect association rules if the target variable is hincfel, which is the feeling about household’s income nowadays.
There are some direct relevance variables, including employment status, education, and maritial or family context which often shape household income and perceptions of economic comfort. Besides, the subjective indicator such as life satisfaction, happines, trust and satisfaction with the economy can be strong correlates in the financial well-being perception. Additionally, some demographic and social dimension should also be considered, like age, gender, health, household structure.
sub_df <- df %>% select(
cntry,
agea,
gndr,
eisced,
maritalb,
hhmmb,
mainact,
stflife,
happy,
ppltrst,
stfeco,
gincdif,
health,
domicil,
sclmeet,
vote,
hincfel)
str(sub_df)
## tibble [40,156 × 17] (S3: tbl_df/tbl/data.frame)
## $ cntry : chr [1:40156] "AT" "AT" "AT" "AT" ...
## $ agea : num [1:40156] 65 21 53 78 64 59 77 69 52 75 ...
## $ gndr : num [1:40156] 1 2 2 2 1 2 2 2 2 2 ...
## $ eisced : num [1:40156] 3 5 6 5 3 4 2 7 5 2 ...
## $ maritalb: num [1:40156] 1 6 1 4 1 1 5 1 1 1 ...
## $ hhmmb : num [1:40156] 2 1 3 1 2 2 1 2 3 2 ...
## $ mainact : num [1:40156] 66 66 66 66 66 66 66 6 66 66 ...
## $ stflife : num [1:40156] 8 9 10 7 9 8 8 8 8 8 ...
## $ happy : num [1:40156] 8 9 9 7 9 8 7 8 8 7 ...
## $ ppltrst : num [1:40156] 5 10 6 6 6 8 7 8 7 3 ...
## $ stfeco : num [1:40156] 6 2 6 4 6 4 6 6 6 7 ...
## $ gincdif : num [1:40156] 2 1 1 1 2 2 2 2 1 2 ...
## $ health : num [1:40156] 3 2 1 3 2 1 2 2 2 2 ...
## $ domicil : num [1:40156] 3 1 3 1 4 4 3 4 4 4 ...
## $ sclmeet : num [1:40156] 4 7 4 6 5 6 4 4 4 3 ...
## $ vote : num [1:40156] 1 1 1 2 1 1 1 1 1 1 ...
## $ hincfel : num [1:40156] 1 2 1 2 2 1 2 1 2 2 ...
summary(sub_df)
## cntry agea gndr eisced
## Length:40156 Min. : 15.00 Min. :1.000 Min. : 1.000
## Class :character 1st Qu.: 37.00 1st Qu.:1.000 1st Qu.: 3.000
## Mode :character Median : 53.00 Median :2.000 Median : 4.000
## Mean : 58.06 Mean :1.533 Mean : 4.532
## 3rd Qu.: 67.00 3rd Qu.:2.000 3rd Qu.: 6.000
## Max. :999.00 Max. :2.000 Max. :99.000
## maritalb hhmmb mainact stflife
## Min. : 1.000 Min. : 0.000 Min. : 1.00 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.:66.00 1st Qu.: 6.000
## Median : 2.000 Median : 2.000 Median :66.00 Median : 8.000
## Mean : 3.702 Mean : 2.942 Mean :59.66 Mean : 7.784
## 3rd Qu.: 6.000 3rd Qu.: 3.000 3rd Qu.:66.00 3rd Qu.: 9.000
## Max. :99.000 Max. :99.000 Max. :99.00 Max. :99.000
## happy ppltrst stfeco gincdif
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. :1.000
## 1st Qu.: 7.000 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.:1.000
## Median : 8.000 Median : 5.000 Median : 5.000 Median :2.000
## Mean : 7.741 Mean : 5.361 Mean : 5.935 Mean :2.168
## 3rd Qu.: 9.000 3rd Qu.: 7.000 3rd Qu.: 6.000 3rd Qu.:3.000
## Max. :99.000 Max. :99.000 Max. :99.000 Max. :9.000
## health domicil sclmeet vote
## Min. :1.000 Min. :1.00 Min. : 1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.00 1st Qu.: 4.000 1st Qu.:1.000
## Median :2.000 Median :3.00 Median : 5.000 Median :1.000
## Mean :2.154 Mean :2.96 Mean : 4.941 Mean :1.382
## 3rd Qu.:3.000 3rd Qu.:4.00 3rd Qu.: 6.000 3rd Qu.:2.000
## Max. :9.000 Max. :9.00 Max. :99.000 Max. :9.000
## hincfel
## Min. :1.000
## 1st Qu.:1.000
## Median :2.000
## Mean :1.921
## 3rd Qu.:2.000
## Max. :9.000
Check how many countries are there and inspect hincfel column
table(sub_df$cntry)
##
## AT BE CH CY DE ES FI FR GB GR HR HU IE IS IT LT
## 2354 1594 1384 685 2420 1844 1563 1771 1684 2757 1563 2118 2017 842 2865 1365
## NL NO PL PT RS SE SI SK
## 1695 1337 1442 1373 1563 1230 1248 1442
table(sub_df$hincfel, useNA = "ifany")
##
## 1 2 3 4 7 8 9
## 14309 18108 5862 1426 299 133 19
Checking for missing values and empty strings
isna <- colSums(is.na(sub_df))
cat("Missing values by column:\n")
## Missing values by column:
isna
## cntry agea gndr eisced maritalb hhmmb mainact stflife
## 0 0 0 0 0 0 0 0
## happy ppltrst stfeco gincdif health domicil sclmeet vote
## 0 0 0 0 0 0 0 0
## hincfel
## 0
Since the main activity variables has more than 36 thousands missing value, I will remove it from the analysis, also, rows with missing value will be removed.
sub_df <- sub_df %>% select (-mainact)
sub_df <- sub_df %>% filter(!if_any(everything(), is.na))
Rows with refusal, don’t know and no answer will be removed from column hincfel.
For all other columns, refusal answer, don’t know and know answer will be removed as I think it will not bring value to the result.
sub_df <- sub_df %>%
mutate(
hincfel = if_else(hincfel %in% c(7,8,9), NA_real_, hincfel),
agea = if_else(agea == 999, NA_real_, agea),
gndr = if_else(agea == 9, NA_real_, gndr),
eisced = if_else(eisced %in% c(0, 55, 77,88,99), NA_real_, eisced),
maritalb = if_else(maritalb %in% c(77,88,99), NA_real_, maritalb),
hhmmb = if_else(hhmmb %in% c(77,88,99), NA_real_, hhmmb),
stfeco = if_else(stfeco %in% c(77,88,99), NA_real_, stfeco),
happy = if_else(happy %in% c(77,88,99), NA_real_, happy),
stflife = if_else(stflife %in% c(77,88,99), NA_real_, stflife),
ppltrst = if_else(ppltrst %in% c(77,88,99), NA_real_, ppltrst),
gincdif = if_else(gincdif %in% c(7,8,9), NA_real_, gincdif),
health = if_else(health %in% c(7,8,9), NA_real_, health),
domicil = if_else(domicil %in% c(7,8,9), NA_real_, domicil),
sclmeet = if_else(sclmeet %in% c(77,88,99), NA_real_, sclmeet),
vote = if_else(vote %in% c(7,8,9), NA_real_, vote))
sub_df <- sub_df %>%
mutate(
agea = case_when(
agea < 30 ~ "Under 30",
agea >= 30 & agea < 50 ~ "30 to 50 years old",
agea >= 50 & agea < 65 ~ "50 to 64 years old",
agea >= 65 ~ "65 plus"),
gndr = case_when(
gndr == 1 ~ "Male",
gndr == 2 ~ "Female"),
eisced = case_when(
eisced == 1 ~ "Primary Education",
eisced == 2 ~ "Lower Secondary",
eisced == 3 ~ "Upper Secondary",
eisced == 4 ~ "Post-Secondary, Non-Tertiary",
eisced == 5 ~ "Short-cycle Tertiary",
eisced == 6 ~ "Bachelor's or Equivalent",
eisced == 7 ~ "Master's or Equivalent",
eisced == 8 ~ "Doctoral or Equivalent"),
maritalb = case_when(
maritalb == 1 ~ "Legally married",
maritalb == 2 ~ "In a legally registered",
maritalb == 3 ~ "Legally separated",
maritalb == 4 ~ "Legally divorced",
maritalb == 5 ~ "Widowed",
maritalb == 6 ~ "Never married or in legally registered"),
hhmmb = case_when(
hhmmb < 2 ~ "Less then 2 in a family",
hhmmb >=2 & hhmmb < 3 ~ "2 to 3 family members",
hhmmb <= 3 ~ "more than 3 family members"),
stfeco = case_when(
stfeco <= 3 ~ "Not satisfied w country ecno",
stfeco > 3 & hhmmb < 6 ~ "Satisfy w country ecno",
stfeco >=6 ~ "Very satisfy w country ecno"),
happy = case_when(
happy <= 3 ~ "Not happy",
happy > 3 & happy < 6 ~ "Feeling so-so",
happy >=6 ~ "Happy"),
stflife = case_when(
stflife <= 3 ~ "Not satisfied w life",
stflife > 3 & stflife < 6 ~ "Life is ok",
stflife >=6 ~ "Satisfied w life"),
ppltrst = case_when(
ppltrst <= 3 ~ "can't be too careful w ppl",
ppltrst > 3 & ppltrst < 6 ~ "can be trust",
ppltrst >=6 ~ "can trust most people"),
gincdif = case_when(
gincdif < 3 ~ "inc diff shoud reduce",
gincdif == 3 ~ "inc diff is ok",
gincdif > 3 ~ "inc diff shoud not reduce"),
health = case_when(
health < 3 ~ "good health",
health == 3 ~ "health ok",
health > 3 ~ "bad health"),
domicil = case_when(
domicil == 1 ~ "big city",
domicil == 2 ~ "Suburbs/outskirts of big city",
domicil == 3 ~ "Town/small city",
domicil == 4 ~ "Country village",
domicil == 5 ~ "Farm/home in countryside"),
sclmeet = case_when(
sclmeet <= 3 ~ "rarely socialize",
domicil <= 5 ~ "sometimes socialize",
domicil >5 ~ "often socialize"),
vote = case_when(
vote == 1 ~ "voted",
vote == 2 ~ "did not vote",
vote == 3 ~ "not eligible to vote"),
hincfel = case_when(
hincfel == 1 ~ "Living comfortably on prs income",
hincfel == 2 ~ "Coping on prs income",
hincfel == 3 ~ "Difficult on prs income",
hincfel == 4 ~ "Very difficult on prs income"))
sub_df <- na.omit(sub_df)
Convert data to transactions for association rules
df_items <- lapply(seq_len(nrow(sub_df)), function(i) {
row_i <- sub_df[i, ]
paste0(names(sub_df), "=", row_i)})
library(arules)
trans <- as(df_items, "transactions")
summary(trans)
## transactions as itemMatrix in sparse format with
## 24141 rows (elements/itemsets/transactions) and
## 78 columns (items) and a density of 0.2051282
##
## most frequent items:
## happy=Happy stflife=Satisfied w life
## 20679 19495
## sclmeet=often socialize vote=voted
## 19010 18791
## gincdif=inc diff shoud reduce (Other)
## 18116 290165
##
## element (itemset/transaction) length distribution:
## sizes
## 16
## 24141
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16 16 16 16 16 16
##
## includes extended item information - examples:
## labels
## 1 agea=30 to 50 years old
## 2 agea=50 to 64 years old
## 3 agea=65 plus
labels_old <- itemLabels(trans)
labels_new <- sub(".*=", "", labels_old)
itemLabels(trans) <- labels_new
freq <- itemFrequency(trans)
freq_sorted <- sort(freq, decreasing=TRUE)
head(freq_sorted, 10)
## Happy Satisfied w life often socialize
## 0.8565925 0.8075473 0.7874570
## voted inc diff shoud reduce good health
## 0.7783853 0.7504246 0.6498902
## 2 to 3 family members Female Legally married
## 0.5655938 0.5228035 0.5084711
## Male
## 0.4771965
itemFrequencyPlot(trans, topN=20, type="absolute", main="Top frequent items")
From the plot, it seem like many people are happy and satisfied with their life as they are the most frequently occurring items, suggesting a significant portion of the population reports a positive outlook or well being.
Also, lots of people would like to socialize often, and many of them did vote at the last national election.
rules <- apriori(trans, parameter = list(supp = 0.01, conf = 0.5, minlen = 2), appearance = list(default = "lhs", rhs = c("Living comfortably on prs income", "Coping on prs income", "Difficult on prs income","Very difficult on prs income")), control=list(verbose=F))
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_sorted))
## lhs rhs support confidence coverage lift count
## [1] {Master's or Equivalent,
## Male,
## Happy,
## good health,
## Legally married,
## can trust most people,
## often socialize,
## Satisfied w life,
## voted} => {Living comfortably on prs income} 0.01226130 0.8629738 0.01420819 2.420477 296
## [2] {Master's or Equivalent,
## Male,
## Happy,
## good health,
## 2 to 3 family members,
## Legally married,
## can trust most people,
## often socialize,
## Satisfied w life} => {Living comfortably on prs income} 0.01035583 0.8620690 0.01201276 2.417940 250
## [3] {Master's or Equivalent,
## Male,
## Happy,
## good health,
## Legally married,
## can trust most people,
## often socialize,
## voted} => {Living comfortably on prs income} 0.01234414 0.8612717 0.01433246 2.415703 298
## [4] {Master's or Equivalent,
## Male,
## Happy,
## good health,
## 2 to 3 family members,
## Legally married,
## can trust most people,
## often socialize} => {Living comfortably on prs income} 0.01052152 0.8610169 0.01221987 2.414989 254
## [5] {Master's or Equivalent,
## Male,
## good health,
## 2 to 3 family members,
## Legally married,
## can trust most people,
## often socialize,
## Satisfied w life} => {Living comfortably on prs income} 0.01039725 0.8595890 0.01209560 2.410984 251
## [6] {Master's or Equivalent,
## Male,
## good health,
## 2 to 3 family members,
## Legally married,
## can trust most people,
## often socialize} => {Living comfortably on prs income} 0.01064579 0.8595318 0.01238557 2.410823 257
plot(rules, method = "grouped")
There are total of 38673 rules were mined by using Apriori algorithmm that has the length from 2 to 10 items. The highest number of rules (10,016) have a length of 6 items, followed by rules with lengths 5 (6,549) and 7 (9,489) that indicates that the data supports rules with multiple antecedents and consequences.
Since there are many rules that have low support range, meaning they occur only in a few transactions, so I will clean the rules by remove the redundant one.
strong_rules <- rules[!is.redundant(rules)]
strong_rules <- strong_rules[(is.maximal(strong_rules))]
summary(strong_rules)
## set of 2381 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6 7 8 9 10
## 23 200 541 716 534 228 92 47
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 6.186 7.000 10.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01002 Min. :0.5000 Min. :0.01189 Min. :1.101
## 1st Qu.:0.01040 1st Qu.:0.5260 1st Qu.:0.01806 1st Qu.:1.165
## Median :0.01106 Median :0.5517 Median :0.02005 Median :1.227
## Mean :0.01179 Mean :0.5797 Mean :0.02070 Mean :1.407
## 3rd Qu.:0.01218 3rd Qu.:0.6031 3rd Qu.:0.02220 3rd Qu.:1.629
## Max. :0.03086 Max. :0.8630 Max. :0.05712 Max. :2.420
## count
## Min. :242.0
## 1st Qu.:251.0
## Median :267.0
## Mean :284.7
## 3rd Qu.:294.0
## Max. :745.0
##
## mining info:
## data ntransactions support confidence
## trans 24141 0.01 0.5
## call
## apriori(data = trans, parameter = list(supp = 0.01, conf = 0.5, minlen = 2), appearance = list(default = "lhs", rhs = c("Living comfortably on prs income", "Coping on prs income", "Difficult on prs income", "Very difficult on prs income")), control = list(verbose = F))
plot(strong_rules, method = "grouped")
After reduce the redundant rules, I have 2381 rules in total. In the next step, I will inspect the rule for different categories in feeling about household’s income nowadays.
comfy_rules <- subset(strong_rules, rhs %in% "Living comfortably on prs income")
inspectDT(comfy_rules )
plot(comfy_rules, method = "graph", control = list(type = "items"))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
plot(comfy_rules,
method = "scatterplot",
measure = "support",
shading = "lift",
engine = "plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
A prominent factor, indicating that individuals who report good health are more likely to feel comfortable with their income. It might reflects the financial stability associated with fewer healthcare-related expenses.
Suprisingly, male appears to correlate strongly with the feeling of living comfortably on present income which can reflect gender income disparities or societal norms influencing financial perception.
Those who are legally married associciated with better financial comfort due to the possibility of shared financial resources with their partner.
Besides, family with 2 to 3 members are strongly linked to financial comfort, maybe due to the reduce financial strain. Individuals from 50 to 64 years old often experience peak earning years or financial stability, that contribute to higher comfort level.
Regarding geographical factor, the Netherland and Sweden feature prominently, reflecting higher comfort levels in countries with strong welfare systems and economic stability.
coping_rules <- subset(strong_rules, rhs %in% "Coping on prs income")
inspectDT(coping_rules)
plot(coping_rules, method = "graph", control = list(type = "items"), main = "Obesity rules ")
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
library(arulesViz)
plot(coping_rules, method = "graph", engine = "htmlwidget")
“Coping” often indicates a mid-range situation that is not truly comfortable, but not severely struggling with the present income. So items like “more than 3 family members” or “2 to 3 family members” might reflect a moderate household size—big enough to cause some financial pressure but not extreme.
“Rarely socialize” can be correlated with sense of control or financial constraints. It might reflect less disposable income for leisure or less inclination to go out—so one might interpret that as an indicator of moderate financial caution.
People are satisfy with country economy but still “Coping” might mean that personal finances are borderline, though they consider the overall economy is not too bad.
People in Lithuania, Hungary, Poland and Finland appear most frequent in this category suggests these countries have a considerable share of respondents who find themselves neither fully comfortable nor severely struggling with their income. While Finland has a well-developed welfare state, other factors like personal debt or higher living costs could constrain how “comfortable” individuals feel.
There is no rules found for people that feel difficult on the present income
There is no rules found for people that feel very difficult on the present income
From both groups who feel comfortable with the current income and those who coping with it, many of them state that the government should reduce the income differences which can better enhance financial well-being of household.
The lack of discernible rules for difficult or very difficult financial situations may indicate the sample sizes for those categories or the respondents in severe financial straits share fewer common traits captured by the variables in this dataset.
Overall, the association rules underscore how health, demographic such as gender, marital status, age, and national context all interplay to shape subjective financial comfort.